In our previous two projects, we spent a great deal of time predicting income on the US census dataset. What we would like to determine in this project, is to see if the variables we determined important in predicting high or low income are important factors in association rule mining as well. Please refer to our second lab for more information. Previously, we have discovered that important predictor variables in classifying income level were:
marriage status
Male or Female
age
Hours per week
Job role
Education
In this study we will examine the same dataset through association rule mining. We will not only attempt to affirm our previously chosen factors that are associated with low or high income, but also look deeper into the relationships, and examine what levels of each factor are associated with high or low income.
We will perform a small ARM analysis and a large one while identifying relationships between both high and low income brackets with all the factors present in the dataset. We will then dive deeper, and perform ARM analysis on the relationships between all the factors of our dataset with high income bracket, and low income bracket individually, in an attempt to determine precisely what factors are associated with what income bracket. Finally, we will demonstrate an implementation of a grid search for the apriori algorithm, and propose upper and lower bounds for digestibility of a set of association rules, and demonstrate our process of selecting the strongest set of rules that we feel can be easily understood.
We chose this dataset from the UCI’s machine learning repository for its categorical predictive attributes. It contains 1994 Census data pulled from the US Census database. The prediction task we’ve set forth is to predict if a person salary range is >50k in a 1994, based on the various categorical/numerical attributes in the census database. The link to the data source is below:
https://archive.ics.uci.edu/ml/datasets/census+income
The effectiveness of our algorithm will be determined by support, confidence and lift. As these are the metrics that describe how strong a relationship between each element is with the other elements within each transaction. Currently, there are no methods for cross validation of association rules, although there are some hard working individuals out there that are attempting to create such a tool.
Here we will discuss each attribute and give some description about its ranges.
First, lets go ahead and load up necessary libraries:
Loaded Packages:
arules, arulesViz, forcats, dplyr, plotly, data.table,
pander, knitr, skimr, lubridate, ggplot2, cowplot,
foreach, doParallel
Next, lets import our dataset
data <- read.csv("./data/adult-training.csv")
The first thing we must do is check and see if there are any NAs in our dataset, just to make sure to not mess up our analysis.
NA_sum <- sort(sapply(data, function(x) sum(is.na(x))), decreasing = TRUE)
data.frame((NA_sum))
Looks like we are doing ok here. The next issue we have in the dataset, is because of the way the csv was stored, some of the levels in our factors include leading and trailing whitespace. This is highly undesirable, so we must clean it up:
GetFactors <- function(df) {
return(names(Filter(is.factor, df)))
}
FixLevels <- function(x) {
levels(x) <- trimws(levels(x))
return(x)
}
data[GetFactors(data)] <- lapply(data[GetFactors(data)], FixLevels)
pander(lapply(data[GetFactors(data)], levels))
Next, we need to reencode our data as factors. First, lets encode the education levels into factors with larger groups (for example 1st-12th grade should be no diploma, not a bunch of levels).
data$education <- fct_collapse(data$education, `No Diploma` = c("1st-4th", "5th-6th",
"7th-8th", "9th", "10th", "11th", "12th", "Preschool"), Associates = c("Assoc-acdm",
"Assoc-voc"), Diploma = c("Some-college", "HS-grad"))
Then the the income brackets:
data$income_bracket <- fct_collapse(data$income_bracket, small = "<=50K", large = ">50K")
Next, lets change the ? levels to something more useful:
levels(data$workclass)[levels(data$workclass) == "?"] <- "Other"
levels(data$occupation)[levels(data$occupation) == "?"] <- "Other-service"
levels(data$native_country)[levels(data$native_country) == "?"] <- "Other"
data[GetFactors(data)] <- lapply(data[GetFactors(data)], FixLevels)
pander(lapply(data[GetFactors(data)], levels))
Next, lets remove the fnlwgt, education number, and capital gain and loss columns, as they are unneeded. We also need to rename some columns to be easier for us, and use the cut function to factorize our numeric variables
data <- data[, -c(3, 5, 11:12)]
data$age <- cut(data$age, breaks = c(15, 25, 45, 65, 100), labels = c("Young",
"Middleaged", "Senior", "Retired"))
data$hours_per_week <- cut(data$hours_per_week, breaks = c(0, 20, 40, 60, 80),
labels = c("part-time", "full-time", "hard-working", "need-a-life"))
library(magrittr)
data %>% skim_to_list() %$% factor %>% select(-c("ordered", "missing", "complete")) %>%
kable
| variable | n | n_unique | top_counts |
|---|---|---|---|
| age | 32561 | 4 | Mid: 16523, Sen: 8469, You: 6411, Ret: 1158 |
| education | 32561 | 7 | Dip: 17792, Bac: 5355, No : 4253, Ass: 2449 |
| gender | 32561 | 2 | Mal: 21790, Fem: 10771, NA: 0 |
| hours_per_week | 32561 | 4 | ful: 20052, har: 8471, par: 2928, nee: 902 |
| income_bracket | 32561 | 2 | sma: 24720, lar: 7841, NA: 0 |
| marital_status | 32561 | 7 | Mar: 14976, Nev: 10683, Div: 4443, Sep: 1025 |
| native_country | 32561 | 42 | Uni: 29170, Mex: 643, Oth: 583, Phi: 198 |
| occupation | 32561 | 14 | Oth: 5138, Pro: 4140, Cra: 4099, Exe: 4066 |
| race | 32561 | 5 | Whi: 27816, Bla: 3124, Asi: 1039, Ame: 311 |
| relationship | 32561 | 6 | Hus: 13193, Not: 8305, Own: 5068, Unm: 3446 |
| workclass | 32561 | 9 | Pri: 22696, Sel: 2541, Loc: 2093, Oth: 1836 |
Lets see the results:
We’d also like to get a quick feel for the dataset through some visulizations.
p1 <- ggplot(data) + geom_bar(aes(x = age, fill = income_bracket), stat = "count") +
labs(x = "Age", y = "Count", title = "Age by Income", subtitle = "Histogram plot")
p1
Our first plot shows us a quick histogram of age groups and their income group. Not surprisingly, the majority of the “large” income group is in middleaged and senior groups supporting the common knowledge that as you get older, you should be making more money.
p2 <- ggplot(data, aes(x = education, fill = income_bracket, color = income_bracket)) +
geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Education",
y = "Proportion", title = "Income bias based on Education", subtitle = "Stacked bar plot")
p2
Our next plot is a stacked bar chart of income group by density of education. We see here that higher education (Bachelors, Masters, Doctorate) yields a higher proportions of the “large” income bracket. Professional school also about equal with Docotorate suggesting specilaized trades and higher education are the best bet for making over 50k a year.
p3 <- ggplot(data, aes(x = marital_status, fill = income_bracket, color = income_bracket)) +
geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Marital Status",
y = "Proportion", title = "Income bias based on Marital status", subtitle = "Stacked bar plot")
p3
Using the similar plot style as above, we’ll now look at income group by density of Marital Status. Which too no surprise to us researchers, married couples have a distinct advantage over those that are unmarried. In fact, the next highest categories are Widowed and Divorced. Which means, if you want to make more money in life, you should at least give marriage a try.
p4 <- ggplot(data, aes(x = occupation, fill = income_bracket, color = income_bracket)) +
geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Occupation Status",
y = "Proportion", title = "Income bias based on Occupation status", subtitle = "Stacked bar plot")
p4
Next, we’ll compare income groups by density of Occupation. This revealed a few obvious, but also a few interesting results. Exec-managerial and Prof-specialty were the highest categories to no one’s surprise. Protective-serv, transport-moving and tech-support also show good density for making over 50k. Some of the lowest categories thought were Armed-Forces, and house-cleaning/services.
p6 <- ggplot(data, aes(occupation)) + geom_bar(aes(fill = education), width = 0.5) +
theme(axis.text.x = element_text(angle = 60, vjust = 0.5)) + labs(title = "Histogram of occupation with education binning",
subtitle = "Occupation and Educational")
p6
Lastly, we combine education and ocupation to see if we can notice any particular relationships between the two groups. What kind of educations to the people with large incomes have? Exec-managerial show a large amount of bachelor’s and masters degree’s, which means, more school definitely helps in that occupation. Prof-specialty shows the best range of educational sectors respresented but we found it interesting that prof-school wasn’t a larger representation here. This is probably due to prof-specialty being a wider category than some in that you can be a professional in alot of different occupational sectors.
Armed-forces have an extremely low amount of education information which also makes sense as most are recruited right out of high school. We would have expected to see a higher amount of Diploma’s given out in this group, but perhaps the information is just missing from this dataset.
Finally, we can set up our dataset as a transactional dataset to be in the proper data format for the Apriori algorithm:
data <- as(data, "transactions")
summary(data)
#> transactions as itemMatrix in sparse format with
#> 32561 rows (elements/itemsets/transactions) and
#> 102 columns (items) and a density of 0.1077805
#>
#> most frequent items:
#> native_country=United-States race=White
#> 29170 27816
#> income_bracket=small workclass=Private
#> 24720 22696
#> gender=Male (Other)
#> 21790 231771
#>
#> element (itemset/transaction) length distribution:
#> sizes
#> 10 11
#> 208 32353
#>
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 11.00 11.00 10.99 11.00 11.00
#>
#> includes extended item information - examples:
#> labels variables levels
#> 1 age=Young age Young
#> 2 age=Middleaged age Middleaged
#> 3 age=Senior age Senior
#>
#> includes extended transaction information - examples:
#> transactionID
#> 1 1
#> 2 2
#> 3 3
Option B: Association Rule Mining
• Create frequent itemsets and association rules.
• Use tables/visualization to discuss the found results.
• Use several measure for evaluating how interesting different rules are.
• Describe your results. What findings are the most compelling and why?
Before we begin with our analysis, lets check out the rule frequencies within the dataset. We are looking for rules with support >= .2
itemFrequencyPlot(data, support = 0.2)
Next, lets mine some rules with the apriori algorithm, and then clean up redundant rules. We are still sorting out what to set the minsupp and minconf to.
zerules <- apriori(data, parameter = list(minlen = 2, supp = 0.2, conf = 0.15),
appearance = list(rhs = c("income_bracket=small", "income_bracket=large"),
default = "lhs"), control = list(verbose = F))
length(zerules)
#> [1] 91
redundant <- is.redundant(zerules)
zerules.pruned <- zerules[redundant == FALSE]
rulesorted <- sort(zerules.pruned, by = "lift", decreasing = TRUE)
length(rulesorted)
#> [1] 25
Next, let us inspect the rules, and examine their quality
(quality(rulesorted))
inspectDT(rulesorted, caption = "Association Rules")
plot(rulesorted, method = "scatterplot", measure = c("support", "lift"), shading = "confidence",
engine = "htmlwidget")
Scatter Plot
This plot shows us each rule, with x axis as support and Y axis as lift. They are colored by confidence. This makes a nice little view of which rules were made from the most data. These may not always be the most interesting rules, but they are definitely the strongest ones. From looking at this, we can note a few strong, interesting associations.
Females are associated with low income bracket
Not being married is strongly associated with a low income bracket
High school diploma only is associated with low income
Private/housekeeping/server jobs are associated with low income
No family is also associated with low income
These strong rules are by no means earth-shattering, but they do tell us that we are looking in the right direction, as they agree with the (sometimes unfortunate) realities of the world we live in.
plot(rulesorted, method = "graph", measure = "confidence", shading = "lift",
engine = "htmlwidget")
Balloon Plot
The balloon plot is able to filter and zoom, which gives the ability to look through each income group and see what rules are feeding the analysis. Just by seperating out the large income bracket, we begin to see that married-civ-spouce has one of the highest lift scores. The next two highest were white and Male. Which follows suit with our above research as well.
In the small income bracket, we see rules such as (Never-married, fulltime), (female, private/housekeeper/server), (Private/housekeeper/server, never-married). This would begin to tell us that if you’re female, never-married and work full time, then you’re chances are higher for making under 50k.
plot(rulesorted, method = "paracoord", measure = "confidence", shading = "lift",
control = list(reorder = T))
Parallel Plot
The width of the arrows represents support and the color represents confidence. We can see a few interesting things here. First of all, we have a striking level of confidence that being married gets you a large income, and then also interestingly we see that working as a private worker/housekeeper/server is a part of a lot of the rules related to being in the low income bracket (You can see this by looking where all the rules converge on LHS order 1).
This plot is often used to discuss the relationship between the order of the rules and their support/confidence.
plot(rulesorted, method = "two-key plot", measure = "lift", shading = "confidence",
engine = "htmlwidget")
Two Key Plot
We can find several interesting bits of information here. First, we see that we only produced one rule of order 4, a few rules of order 3, and a bunch of order 2. We also note that the area with high support and confidence (upper right, a bit of a sweet spot), is dominated by the shorter rules.
After our first attempt at mining for rules, we’re going to relax our support to 0.1 and increase our confidence to 98% see what we get. Aftewards, we’ll target small and large incomes individually in order to see if we can isolate the itemsets that are going to be of highest influence.
rule2 <- apriori(data, parameter = list(minlen = 2, supp = 0.1, conf = 0.98),
appearance = list(rhs = c("income_bracket=small", "income_bracket=large"),
default = "lhs"), control = list(verbose = F))
length(rule2)
#> [1] 50
redundant <- is.redundant(rule2)
rulep <- rule2[redundant == FALSE]
rulesorted2 <- sort(rulep, by = "lift", decreasing = TRUE)
length(rulesorted2)
#> [1] 36
We can see that this time around we’ve generated 36 rule’s. We’ll put them through the same inspection and plots as previously to see if any other rules stand out or if we’re seeing more of the same.
head(quality(rulesorted2))
inspectDT(rulesorted2, caption = "Alternate Association Rules")
plot(rulesorted2, method = "scatterplot", measure = c("support", "lift"), shading = "confidence",
engine = "htmlwidget")
We see similar results as in section 7.2.1.
Never being married has a strong association with low income.
High school Diploma is a lower income indicator
Private jobs also carry an association with low income.
One difference was that Female was NOT found to carry an association with low income within these rules. Why, we are unsure.
But we also noticed that two other itemsets began to surface around the highest lift.
Young was seen more often within the higher threshold lift rules, meaning yes, young people don’t make as much money.
Owning a child also had a strong association with low income.
What was also interesting to see is that one of the rules with the highest lift, was (Young, never-married, own-child). Which fits the tragic tale of the struggling single parent within our society.
plot(rulesorted2, method = "graph", measure = "confidence", shading = "lift",
engine = "htmlwidget")
Here the large income bracket didn’t appear in our graph, which validates why we didn’t see any large income factors in the scatter plot. This chart does show us similiarities between what we saw in small income. That some of the highest itemsets for low income were (Young, never-married, own-child, Diploma, Private, White, United-States). The strongest (Rule 1) being (Young, never-married, own-child)
plot(rulesorted2, method = "two-key plot", measure = "confidence", shading = "lift",
engine = "htmlwidget")
Here we see our parameter expanding show differeng itemset levels. With the higher confidence level, we’ve gotten rid of more of the two quantity itemsets, but increased the number of higher order-itemsets. For instance, We now have 5 5-item rulesets with different combinations of all the factors we saw in the previous plot. (Young, never-married, own-child, Diploma, Private, White, United-States).
Now that we’ve scanned a couple of different variations of apriori and generated rules accordingly. Lets split up the targeting in the rhs section to only target one of the income brackets at a time. First we’ll start with low income.
rule_small <- apriori(data, parameter = list(minlen = 2, supp = 0.12, conf = 0.95),
appearance = list(rhs = c("income_bracket=small"), default = "lhs"), control = list(verbose = F))
length(rule_small)
#> [1] 45
redundant <- is.redundant(rule_small)
rulep <- rule_small[redundant == FALSE]
rulesorted_small <- sort(rulep, by = "lift", decreasing = TRUE)
length(rulesorted_small)
#> [1] 25
head(quality(rulesorted_small))
inspectDT(rulesorted_small)
It looks like we’ve generated 25 rules for the low income bracket. Lets take a look at the scatter plot and find our releveant itemsets/rules
plot(rulesorted_small, method = "scatterplot", measure = c("confidence", "support"),
shading = "lift", engine = "htmlwidget")
We will discuss the results from small income targeting in the conclusion section.
Our next step is to now target the large income bracket and see what rules shake out
rule_large <- apriori(data, parameter = list(minlen = 2, supp = 0.05, conf = 0.55),
appearance = list(rhs = c("income_bracket=large"), default = "lhs"), control = list(verbose = F))
length(rule_large)
#> [1] 53
redundant <- is.redundant(rule_large)
rulep <- rule_large[redundant == FALSE]
rulesorted_large <- sort(rulep, by = "lift", decreasing = TRUE)
length(rulesorted_large)
#> [1] 15
head(quality(rulesorted_large))
inspectDT(rulesorted_large)
It looks like we’ve generated 25 rules for the low income bracket. Lets take a look at the scatter plot and find our releveant itemsets/rules
plot(rulesorted_large, method = "scatterplot", measure = c("confidence", "support"),
shading = "lift", engine = "htmlwidget")
We will discuss the results from large income targeting in the conclusion section.
Interesting small Income rules:
Not being married is associated with a low income. Out of our 25 best rules, 16 of them involved not being married, and our five highest lift rules all involved never married. This is clearly not a coincidence, however what this means and the cause of this is not so clear. We propose a few hypotheses about this:
Married people, motivated by love, work harder and make more money
Shallow society deems people without money harder to marry
People who are married got married because they are more social, and thus they are given more raises, perform better in interviews, etc. In contrast, people who were not sociable enough to meet someone may not perform as well in interviews etc., giving them fewer opportunities to get a job
Having just a diploma from high school does not get you out of the lower income bracket. This is unsurprising, as today you need in general a minimum of a college education to start making good money (and even that is difficult). 5 out of our 25 rules involved highest education level = high school diploma
12 out of 25 rules involved being young. This makes perfect sense, as entry level jobs which young people get do not pay very well.
8 out of 25 rules involved being in the private /housekeeping/server field. This makes sense, as these workers are severly underpaid
4 out of 25 rules involved having a child. If you have a child, you should be spending more time with them, and maybe focusing a little less on your career. Therefore, you will probably earn less. This pattern makes sense.
The rules for our small income target have a support of 12% and confidence of 95%. This means that (never-married, Diploma, Young, Private/housekeeper/server, and child) are observed in 12% of the dataset. We therefore believe that 95% of the time any of the attributes (never-married, Diploma, Young, Private/housekeeper/server, and child) are present, it will result in a small income.
Interesting large Income rules:
Thankfully, opposite to the small income analysis, large income targeting shows that being married greatly influences your income as it shows up in 9 of the 15 rules present. Meaning again that by working as a team, you get more out of life. So go find yourself a husband/wife.
The next highest frequent itemset is hard-working that shows up in 8 of the 15 rules. Which makes perfect sense that the more time you put into a job, the likelier you are for promotion and advancement within an organization.
Education = Bachelors , native_country = United-States, relationship = husband are present in 5 out of the 15 rules.
The rules for our large income target have a support of 05% and confidence of 55%. This means that (married, hard-working, Bachelors, United-States, Husband) are observed in 5% of the dataset. We therefore believe that 55% of the time any of the attributes (married, hard-working, Bachelors, United-States, Husband) are present, it will result in a large income.
Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?
How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)?
How would your deploy your model for interested parties?
What other data should be collected?
How often would the model need to be updated, etc.?
Our current model, wouldn’t be a good option for a production environment. Granted, there is astounding amounts of transacational data out there that can benefit from apriori and associative rule mining. Due to the trial and error used to isolate various rules and itemsets. We’re not confident that a new dataset will have any similar behavior or show similar itemsets. It is probably best used for the 1990 data on which it was created.
Overall, i believe our group achieved the goals we set forth. We learned the process of associative rule mining, discovered new trends and validated those previously seen throughout our analysis of this dataset. We’re satisfied that we’ve created a useful model for anyone who was looking to analyze Census Data.
Interested parties for this dataset could be activist, feminist, political campaigns, or others looking to change public policy through validation of its inhabitants characteristics. The model would theoretically work for other years of Census data (with light variable tweaking), it could also be applied to look for time series trends.
Since our model doesn’t require long run times, one method would be to deploy an endpoint REST API. That way users could query for a prediction on income, then get the results back quickly and easily. Maintenance would be fairly low, so hosting it on an AWS instance might prove frugal and functional.
Other data would consist of supplying other years of Census Data to see how if similar rulesets appear over time. Pairing this dataset with economic information could also prove beneficial as i’m sure the economy would influence unemployment rate, salaries in various industries, and a multitude of other factors that influence people’s income.
Since we’re dealing with Census data, a yearly update of the model would be sufficient to maintain its usefulness.
For our exceptional work, we decided to try and optimize the number of rules produced in our apriori algorithm with the RHS being both low and high income. We decided to further optimize it by finding the ruleset which would provide us the highest average lift, while being digestible. To do that, we implemented a fully parallelized grid search.
There are two parameters to optimize in the apriori algorithm: confidence and support. Normally, these are chosen using an in depth knowledge of the data combined with an understanding of the goals of the study. Because this research is purely academic, and we are new to the apriori algorithm, we do not have a prior understanding to tell us what support and confidence will be the most useful for us. So, we must optimize both. We will begin by creating two objects, conf and supp, for confidence and support respectively. Each will be vectors from 0.01 to 0.99:
x <- 1:99
conf <- supp <- x/100
Next, we will create a data frame containing all possible combinations of support and confidence:
init <- data.frame(conf = conf, supp = supp)
params <- expand.grid(init)
nrow(params)
#> [1] 9801
Next, we would like to create a function which we can iterate through. We would like the function to take in a value of support and confidence, maybe a value of appearance (but with defaults for our specific case) and run it through the apriori algorithm. Then, we would like it to prune redundant results from the rule, get the length and average lift of the ruleset, and return those in a convenient labeled vector:
transactions <- data
search <- function(sup, con) {
rules <- apriori(data = data, parameter = list(minlen = 2, conf = con, supp = sup),
appearance = list(rhs = c("income_bracket=small", "income_bracket=large"),
default = "lhs"), control = list(verbose = F))
redundant <- is.redundant(rules)
rulepruned <- rules[redundant == FALSE]
rulelength <- length(rulepruned)
return(data.frame(support = sup, conf = con, nrules = rulelength, avlift = mean(quality(rulepruned)$lift)))
}
We will perform the grid search in parallel, in two ways: using R’s built in parallel library, utilizing a FORK cluster (not available on Windows), and using doParallel and foreach, with a socket cluster (cross platform).
parallelFor this, we will use the mcmapply (multicore matrix apply). This takes a function, any number of vectors of equal length, pops those vectors into the function, and runs it over the specified number of cores. It outputs them in the form of a list, which we will turn into a data frame
library(parallel)
gs <- mcmapply(search, params[[1]], params[[2]], mc.cores = 12L, SIMPLIFY = F)
results <- do.call(rbind, gs)
doParallelWe can use the amazing doParallel library to run our R code in parallel without process forking. Instead, it connects to your computer’s CPUs using sockets, which work on all OS’s. Because we are not using convenient process forking, we will have to do a little more work to set things up:
libary(doParallel)
library(foreach)
numCores <- parallel::detectCores()
workers <- makeCluster(numCores - 1)
invisible(clusterEvalQ(workers, library(arules)))
registerDoParallel(workers)
Now that we are set up with our cluster of CPUs, we can evaluate our grid search in parallel. Note here, we can specify how we combine our outputs. In this case we are going to combine them using rbind, to give the exact same results as our results object.
results <- foreach(i = 1:nrow(params), .combine = rbind) %dopar% {
search(params[i, 1], params[i, 2])
}
We now must define what it means to be digestible. The authors propose that any functioning human being can get the main idea, or overall pattern, in a set of 20-30 rules. So we will find the rules created that give us that, and we want to also find the rulesets with the highest lift. We will do so using dplyr:
results %>% arrange(desc(avlift)) %>% filter(nrules <= 30) %>% filter(nrules >=
20) %>% filter(support > 0.15) %>% filter(conf > 0.15)
To select the proper support and confidence for the optimal digestibility, we will select the highest average lift
optimalRules <- apriori(data, parameter = list(conf = 0.18, supp = 0.32), appearance = list(rhs = c("income_bracket=small",
"income_bracket=large"), default = "lhs"), control = list(verbose = F))
sortedOptimalRules <- sort(optimalRules, by = "lift", decreasing = TRUE)
inspectDT(sortedOptimalRules)